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Abstract 

This paper investigates differentially private analysis of distance-based outliers. The prob¬ 
lem of outlier detection is to find a small number of instances that are apparently distant from 
the remaining instances. On the other hand, the objective of differential privacy is to conceal 
presence (or absence) of any particular instance. Outlier detection and privacy protection are 
thus intrinsically conflicting tasks. In this paper, instead of reporting outliers detected, we 
present two types of differentially private queries that help to understand behavior of outliers. 
One is the query to count outliers, which reports the number of outliers that appear in a given 
subspace. Our formal analysis on the exact global sensitivity of outlier counts reveals that 
regular global sensitivity based method can make the outputs too noisy, particularly when the 
dimensionality of the given subspace is high. Noting that the counts of outliers are typically 
expected to be relatively small compared to the number of data, we introduce a mechanism 
based on the smooth upper bound of the local sensitivity. The other is the query to discovery 
top-// subspaces containing a large number of outliers. This task can be naively achieved by 
issuing count queries to each subspace in turn. However, the variation of subspaces can grow 
exponentially in the data dimensionality. This can cause serious consumption of the privacy 
budget. For this task, we propose an exponential mechanism with a customized score function 
for subspace discovery. To the best of our knowledge, this study is the first trial to ensure 
differential privacy for distance-based outlier analysis. We demonstrated our methods with 
synthesized datasets and real datasets. The experimental results show that out method achieve 
better utility compared to the global sensitivity based methods. 

Keywords: Differential privacy, Outlier detection. Smooth sensitivity and Exponential mech¬ 
anism 


1 Introduction 

Machine learning and data mining technologies are now becoming increasingly influential in 
our daily life. When data mining is processed over personal data collected from individuals, the 
acquired knowledge might be used to infer private information. Differential privacy is a recent 
notion of privacy tailored to the problem of releasing statistical information (TJ. Differential 
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privacy for statistical queries of various types, such as average, sum, variance, histogram, 
median, and maximum likelihood estimator, have been investigated dElSS. 

As described in this paper, we investigate differentially private outlier analysis. Outlier 
detection is a task to identify instances that are apparently distant from the remaining instances. 
The objective of differential privacy is to prevent adversaries from learning of the presence (or 
absence) of any particular instance from released information. Outlier detection and privacy 
protection are therefore intrinsically conflicting tasks. It presents a challenging difficulty. 

To overcome this difficulty, instead of identifying outliers, we consider reporting statistical 
aggregation on outliers that helps to recognize the occurrence of anomalous situations, with a 
guarantee of differential privacy. More specifically, we examine differentially private queries 
of three types for outlier analysis. One is a query to count outliers that appear in a given 
subspace. Second is a query to discover the top -h subspaces containing numerous outliers. 
Third is a query to detect the top -h outliers are that more likely. 

1.1 Related Works 

We introduce existing studies of privacy aspects of outlier analysis. Secure multiparty compu¬ 
tation (SMC) is a cryptographic tool that facilitates the evaluation of a specified function over 
their private inputs jointly, while maintaining these inputs as private. One earlier study JU intro¬ 
duced an SMC for distance-based outlier detection from horizontally and vertically partitioned 
private databases using random shares. One earlier study @ investigated an SMC for spatial 
outlier detection. Another report of a study @ presented an SMC for distance-based outlier de¬ 
tection with the Mahalanobis distance. Another study |[7j] presented an SMC for density-based 
outlier detection. The objective of these works is to detect outliers securely without mutually 
sharing privately distributed data; privacy invasion caused by observing detected outliers is not 
considered. 

Studies of differential privacy for outlier analysis are few, presumably because of its intrin¬ 
sic difficulty, as described. Only one report in the literature 0 describes a study that considers 
the differential privacy of outlier analysis. This study was conducted to detect anomalous 
changes from a time series under a guarantee of differential privacy. The objective of this 
study is closely related to ours, whereas this method releases a one-dimensional time series 
with differential privacy; outlier detection is applied to the released data as a post process. 
Consequently, the approach differs from ours. 

0 introduced a novel privacy notion, outlier privacy, as a generalization of differential pri¬ 
vacy. Outlier privacy measures an individual’s privacy parameter by how much of an “outlier” 
the individual is. The objective of this study is to define privacy using the notion of outliers, 
but not for differentially private outlier analysis. 

1.2 Our Contribution 

In this paper, we present a methodology for distance-based outlier analysis with guarantee 
of differential privacy. Our proposal consists of two different types of differentially private 
queries. 

Differentially private counting of outliers. This query reports the number of outliers that 
appear in a given subspace. Since the global sensitivity of counts of outliers is very large, the 
resulting outputs can be too noisy. We focus on the observation that the counts of outliers are 
expected to be relatively small compared to the number of data in typical datasets. Taking 
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advantage of this, we develop a randomization mechanism for counts of outliers based on 
the smooth upper bound of the local sensitivity f2). Randomization mechanism based on the 
smooth upper bound typically have better utility because of its data dependency; however, its 
evaluation is often costly. To alleviate this, we provide an efficient algorithm for evaluation of 
smooth upper bound for counting outliers. 

Differentially private discovery of subspaces. This query finds top -h subspaces contain¬ 
ing a large number of outliers. This task can be naively achieved by issuing count queries to 
each subspace in turn. However, the variation of subspaces can grow exponentially in the data 
dimensionality. This can cause serious consumption of the privacy budget. For this task, we 
employed the exponential mechanism. We specifically design a score function for subspace 
discovery which is insensitive to the size of the subspace set. Because of this insensitivity, the 
proposed mechanism achieves better detection accuracy even with high dimensionality. 

To the best of our knowledge, this study is the first trial to ensure differential privacy for 
distance-based outlier analysis. We demonstrated our methods with synthesized datasets and 
real datasets. The experimental results show that our methods achieve better utility compared 
to the global sensitivity based methods. 


2 Differential Privacy 

Let X = {x\,X 2 ,... ,xn} € V N be a database. An analyst issues a query / : V N —>• T ; then 
the database returns an output, where T denotes the range of the outputs. Differential privacy, 
a recent notion of privacy, measures the privacy breach of database X caused by releasing 
output t € T with no assumptions of the background knowledge of adversaries. The outputs 
are typically modified using a mechanism A : V N —> T before release to preserve differential 
privacy. 

Let H(X,X') = \{i : aq A x \} I denote the Hamming distance, the number of different 
records in X and X' . If H(X, X') = 1, then it can be said that X and X' are neighbor 
databases, or X ~ X' shortly. In the following, we presume |X| = \X'\ = N. Then, (e, 8)- 
differential privacy is defined as shown below. 

Definition 1 ((e, ^-Differential Privacy). Mechanism A guarantees (e, 5)-differential privacy 
if, VX ~ X' and VT C T, 

Pr[A(X) € T] < e e Pr[A{X') € T] + 8. 

The parameter e and 8 are designated as privacy parameters. Randomization based on the 
global sensitivity is the most straightforward realization of differential privacy for continuous 
outputs ICQ. The exponential mechanism is a natural extension for discrete outputs ITOll . We 
use both mechanisms for our method, which is explained in detail in the next subsection. 

2.1 Sensitivity-based Method 

2.1.1 Global Sensitivity 

Presuming that the output domain of query / is continuous, then randomization based on the 
global sensitivity |T| provides a mechanism that guarantees differential privacy for queries of 
any type, as long as its global sensitivity is evaluable. The global sensitivity is defined as 
explained below. 
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Definition 2 (Global Sensitivity). Letting V be the domain of data, the global sensitivity of 
query q : V N —>• is given as 

GS q = max || q(X) - q(X ')|| 2 , 

A ~ A 

where |] ■ || 2 denotes I 2 norm of vectors. 

Given the global sensitivity for a specified query, randomization by a normal distribution 
based on the global sensitivity guarantees (e, 5) -differential privacy, as stated by the following 
theorem. 

Teorem 1 (Gaussian Mechanism by Global Sensitivity IMOil ). Let GS q be the global sensitivity 
of a query q : V N —>• M' / . Then, mechanism A that randomizes the output of the query by eq. 
([/} provides (e, 5)-differential privacy 


A q (X) = q(X) + Y, 


( 1 ) 


where Y £ 
with mean 0 and variance 


denotes a noise in which Y is an sample drawn from the Gaussian distribution 
GSq-2 log (2/5) 

- 71 -■ 


2.1.2 Smooth Sensitivity 

For some functions, the global sensitivity can be unpractically large even when the sensitivities 
are small with almost all neighboring pairs. This large sensitivity occurs because it is evaluated 
as the greatest difference of outputs among many possible neighboring pair of databases. For 
example, the global sensitivity of median is N, the whole sample size, but this arises only in 
a pathological situation f2j]. Randomization base on the smooth sensitivity enables the use of 
moderate sensitivity for such sensitive queries. For a given database X, the local sensitivity is 
defined as the greatest difference of outputs for VX' s.t. X' ~ X. 

Definition 3 (Local Sensitivity). Let T> be the domain of the data. Then, the local sensitivity 
of query q : V N —>• W l is given as 

LS q (X) = \\q(X) - q(X ')||. 

It is noteworthy that that GS q = max XeB ,v LS q (X). Nissim et al. GO presented the 
smoothed sensitivity, which is a class of smooth upper bounds to the local sensitivity. 

Definition 4 (Smooth upper bound). For (5 > 0, a function Sp : D n —y M + is a /3-smooth 
upper bound on the local sensitivity of query q if it satisfies the following requirements: 

VX £ D n , S q ,p(X) > LS f (X); 

VX ~ X', S q ,f)(X) < e?S q ,{}{X '). 

The smallest function satisfying Definition |4| is the smooth sensitivity of q: 

Definition 5 (Smooth Sensitivity). Given 0 > 0, the smooth sensitivity of query q : V N s ML 
is 

S*JX ) = max (LS q (X') ■ e ~mx,x’)y 

q,P X'&V N 
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fZj also showed that adding noise proportional to the smooth sensitivity yields a private 
output perturbation mechanism if the noise distribution satisfies some properties. The differ¬ 
ential privacy of a Gaussian mechanism realized by the smooth sensitivity can be stated by the 
following theorem. 

Teorem 2 (Gaussian Mechanism by Smooth Sensitivity |2l). Let Y be a noise generated from 
the Gaussian distribution with mean 0 and variance 1. Let S q ^ be a [3-smooth upper bound 
of query q. Then, if a = z~j= and [3 = ^ p+ \ n 2 / 5 ) > niechanism A guarantees (e, <5)- 

differential privacy: 

A q (X) = q(X) + Sq ’^ X) ■ Y. 

a 


2.2 Exponential Mechanism 

The sensitivity-based method basically presumes that outputs of the target query are real¬ 
valued. The exponential mechanism is a natural extension of the sensitivity-based method 
to discrete outputs. Intuitively, the exponential mechanism relies on a utility function u : 
V N x LZ —>• R that outputs a larger value if the input to the utility function is close to the true 
output. With this utility function, output values that are closer to the true output are likely to be 
provided by the exponential mechanisms, and vice versa. The sensitivity of the utility function 
is defined as presented below. 

Definition 6 (Sensitivity of a utility function a). Let V be the domain of data, and let u : 
V N x TZ —>• R be a utility function. Then, the sensitivity ofu is given as 


A u = max max II u(X,r) — u(X',r) II. 

r&n x,x'ev N -.H(x,x')=i 


Given the sensitivity of a utility function, randomization of outputs following Theorem [3] 
guarantees e-differential privacy. 


Teorem 3 (Exponential Mechanism ifTTl f. Let A u be the sensitivity of utility function u : 
T> N x TZ —y I. Then, mechanism ef, » that randomizes the output of the query by eq. ((2]) 
provides 2eAu-differential privacy 


Prfo, Au(X) =t€TZ} 


exp(e • u(X, t)) 
fjz exp(e • u(X, r))dr 


( 2 ) 


3 Problem Statement 

Our objective is to analyze outliers contained in a private database in a differentially private 
manner. Outlier detection is a problem to identify a point that is significantly distant from 
other points. Hence, the result of outlier detection is essentially privacy invasive; privacy pro¬ 
tection and outlier detection have conflicting objectives. In order to reconcile the contradicting 
goals, we investigate two tasks, (1) counting outliers in a given subspace and (2) discovering 
subspaces containing many outliers, under the constraint of differential privacy. 

Subspace discovery for outlier analysis has been investigated as a major topic of outlier 
detection lfT2l(l3l T4i. The major motivation of existing subspace discovery methods was ba¬ 
sically tackling the high dimensionality. Full-space outlier analysis might fail to detect outliers 
found only in specific sub-spaces because of a large number of irrelevant attributes fl4l fl51. 
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In this study, we solve the subspace discovery problem in order to balance privacy protec¬ 
tion and outlier analysis. The found subspaces can be interpreted as knowledge to understand 
why the points are outliers and how such outliers are generated. After identifying the subspaces 
containing a large number of outliers, the number of the outliers are released. Our solutions 
presented in the following sections guarantees differential privacy for both tasks. 

3.1 Outlier Detection 

In this study, we use distance-based outliers fl6l . Presuming that records are real-valued 
vectors, x % e and letting X = {x, }{/ , denote the database, we let S € {1/2., d\ 
denote a subspace. The Euclidean distance between x, y £ 1R/ / in subspace S is denoted by 

(x—y ) 2 _ 

dists{x,y) = y — ieg jgj —— ('141. Then, the set of neighborhood vectors of x in subspace 
S is defined as follows. 

Definition 7 (Neighboring vector in subspace S). Let r > 0 and k € {1 , ,N}. Then, the 
set of neighboring vectors ofx £ S is 

N S (X, r, x) = {x € X\dist s (x, y) <r,x / y,y £ X}. 

With this definition of the neighboring vectors, the outliers are defined as follows. 

Definition 8 (Outliers in subspace S). Given threshold k and radius r, the set of outliers of X 
in subspace S is 

O s (X,k,r) = {x e X\\N s (X,r,x)\ < k}. 

Distance-based outliers are definable in this study with any type of object and distance 
defined for the corresponding objects, but we presume that the objects are represented as real 
vectors and that they use the Euclidean distance as the distance definition. 

3.2 Queries for Outlier Analysis 

As already discussed, we consider two tasks for outlier analysis, outlier count and subspace 
discovery. 

Let S € 2'f 1 ’ 2 ’-’^ be a target subspace. Then, the task of outlier count is to find the number 
outliers in subspace S: 

q C ount(X, k,r, S) = \Os{X,k,r)\. 

If the subspace is not specified, 0(X, k, r) denotes the set of outliers in the full dimension. 

Let S C be a subspaces. The task of top-fi subspace discovery is to identify h 

subspaces in S containing the h largest number of outliers: 


Qsubspace(X > k'l'f, h, S) {*^7r(l) > ^n(2) ? ■ ■ • ; ^7t(h )} 

where it : (1, ... , |<S|} — >• {1 ..... |5| } is a function that outputs the index of the subspace 
ordered by q COU nt(X , r, S). For example, ir(i) denotes the index of the subspace containing 
the ith largest number of outliers. 
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3.3 Differential privacy of Outlier Analysis 

We introduce several typical scenarios of differentially private outlier analysis using the two 
types of queries, q count and q subs pace- 

Scenario 1. Given threshold k and radius r, suppose the objective is to inspect that the 
outliers exists in the given dataset. The analyst issues query z = q covn i (X. k, r), and then 
checking z > 0 yields the final result. Let z' = q coun t(X' ,k,r). For guarantee of (e, 8)- 
differential privacy, we require, for VX ~ X' and Vt E T, 

Pr[t = Vl(z)] < e 6 Pr[t = A(z')\ + 5. 

Scenario 2. Let data dimension be d = 3. Given threshold k and radius r, suppose the 
objective is to identify the subspaces that cause the two largest number of outliers and learn 
the number of outliers in the two discovered subspaces. Then, the target subspace set is S = 
{{!}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}. The analyst issues query q su bs P ace(X , k, r, 2, S ) 
and obtains the two subspaces z\ = {5 7r (i ) , <S' 7r ( 2 ) }• For each subspace, the analyst issues 
queries as z 2 = q CO unt{X, k, r, S^fi)) and ^3 = q C ount(X, k, r, S n ( 2 ))- For guarantee of (e, 5)- 
differential privacy, we require, for VX ~ X' and Vt € T, 

Pr[t = A(zi,z 2 ,z 3 )] < e € Pr[t = A(z[, z' 2 , z' 3 )\ + S, 

where z\, z ' 2 , and z' 3 are reposes learned from X' ~ X. 

4 Differentially Private Count of Outliers 

As explained in this section, we investigate the problem of differentially private count of out¬ 
liers in a given subspace. The discussion herein holds for any subspace including the full space. 
Therefore, for this discussion, we presume that the outlier is counted in the full dimension. 

4.1 Difficulties in Global Sensitivity Method 

Analytical evaluation of the global sensitivity of determination of q COU nt is not trivial, partly 
because it needs the kissing number. The kissing number Kd is the largest number of hy¬ 
perspheres with same radius in 1R' / that can touch equivalent hyperspheres with no intersec¬ 
tions mmm. The kissing numbers in d = 1 and d = 2 are readily derived respectively 
as K\ = 2 and /\ 2 = 6 (see Fig. [2] for X 2 = 6). Flowever, finding the kissing number in 
d > 3 is not trivial. In addition, the kissing number in general dimensions remains as an open 
problem lfl7lfT8l[T9l . We derive the upper and lower bound of the global sensitivity of q coun t 
presuming that the kissing number in general dimensions is given. 

Teorem 4 (Upper and lower bound on the global sensitivity of q CO unt )• Let K,j be the kissing 
number in M' / . Then, the upper and lower bound on the global sensitivity of q COU nt I s 

min(X, 2 dk + 1) < GS qcounud {k) < min(X, kK d + 1). (3) 

Proof. The lower bound is trivial so we omit the proof. We show the proof for the upper 
bound. In the problem of the kissing number, suppose the radius of the center hypersphere 
and the hyperspheres touching the center hyperspheres (referred to as the surrounding hyper¬ 
spheres) are r/ 2. The distance between the center point of the center hypersphere and those 
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Dimension 


Figure 1: The bounds of the global sensitivity for counting outliers 


of the suiTounding hyperspheres are r. Noting that no intersection between the surrounding 
hyperspheres does not exist, the distance between the center point of any two surrounding hy¬ 
perspheres are equal to or greater than r (the equality holds if the two surrounding hyperspheres 
arc touching). 

Suppose xo be the center of the center hypersphere and x\ be the center of a surrounding 
hypersphere that does not touch any other surrounding hyperspheres. We further suppose k— 1 
datapoints exist at exactly the same location as x\, that is, x\ = = ... = xp.. Letting X = 

{xo , x±, ..., Xk}, q C ount.{X , k, r) = 0 because all the k + 1 points are within a hypersphere 
of radius r. If x'o is removed from X as X' = { x j...., xp .}, q C ount(X, k. r) = k + 1 holds 
because the remaining k points do not have k neighbor vectors and xo itself can be an outlier 
after moved. 

By definition of the kissing number, the number of the surrounding hyperspheres that does 
not touch mutually is at most /C,/. By applying the setting described above for each surround¬ 
ing hypersphere, we have a database that holds q CO unt(X , k, r) = 0, but after moving of xq, 
qcount(X, k,r ) = kX,i + 1. Noting that no more hyperspheres cannot be packed around xq, 
this is the upper bound of the outlier count. □ 

We empirically investigate the tightness of the bound in low dimensions. In d = 1 and d = 
2, the global sensitivity is given respectively as GS qcounttl (k) = 2k + 1 and GS qcount ^{k ) = 
5 k + 1. Noting that K\ = 2 and Ko = 6, the bound is tight in d = 1 but not in d — 2. Fig. [Tj 
shows the upper and lower bounds of the global sensitivity of q CO unt evaluated using known 
upper bounds on the kissing number lUTl FT8 , 19]. As the figure shows, the upper bound of 
the global sensitivity grows exponentially with respect to the dimensionality, which indicates 
that the guarantee of differential privacy by perturbation based on the global sensitivity can be 
impractical, especially when the dimensionality of the target subspace is large. 

The global sensitivity can be prohibitively large simply because the global sensitivity is 
evaluated considering the worst case. However, one can typically expect that the number of 
outliers in the database is much smaller than the number of instances. To improve the utility of 
the count query, we introduce the smooth sensitivity, which is a sensitivity definition depending 
on the database. 
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Figure 2: This figure shows an example of the upper bound of the global sensitivity in two di¬ 
mension. Six surrounding hyperspheres can be packed around the center hypersphere because the 
kissing number is K 2 = 6. We here suppose k datapoints exist at the center of each surrounding 
hypersphere and no datapoint exists at x 0 , the center of the center hypersphere. Then, kl\ 2 outliers 
become inliers by adding a point to x$. Suppose the added point is an outlier, Then, the added 
point can be changed from an outlier to an inlier, too. The upper bound of the global sensitivity for 
two dimension is thus kK 2 + 1 = 6k + 1. 

4.2 Local Sensitivity and Smooth Sensitivity 

For convenience of discussion later, several notations are introduced here. Given radius r, 
deg(x ) denotes the size of neighborhoods of x: 

deg(X, r,x) = \N(X,r,x)\. 

We say that the degree of x is k if deg(X, r, x) = k. A set of vectors in X whose degree is 
exactly k is denoted as 


V(X, k, r) = {x e X : deg{x) = k}. 

Unless specifically stated otherwise, the radius r and target database X is fixed. Therefore, 
they are omitted as deg(x) and U(/;:). Finally, a set of degree-/,; neighborhoods of x in A is 
denoted as 

CV(X, x, k, r) = B(x, r) n V(k), 
where B(x, r ) denotes the sphere with radius r and centered at x. 

4.2.1 Local Sensitivity 

Given database X, let X\ be a database s.t. H{X, X\ ) = 1. Then, following the definition of 
the local sensitivity in Section [2X2J the local sensitivity of q COU nt is defined as 

LSZjX,k,r) — max WqcountiXo, k, v') qcount(X\ . k. r) ||. 
Xi:H(X,Xi)=l 

Exact evaluation of the exact local sensitivity is intractable. Instead, the following theorem 
gives the upper bound of the local sensitivity. 


9 




Teorem 5. Given X, the local sensitivity of q coun t for X is bounded above as 



max{|CV(X, x, k — 1 , r 



Proof Intuitively, CV(Xo,x,k,r ) represents the set of non-outliers that become outliers if 
x is removed; CV(Xo,x,k — l,r) is the set of outliers that become inliers if a vector is 
placed at x. Thus, if vector xq € Xo is moved to x' 0 , the number of outliers increases by 
| CV ( Xo, X[), k. r)| by removing x$ and the number of inliers decreases by | CV (Xo, x ' () , k — 
1, r)| by adding x[ y With this understanding, the local sensitivity is given as: 



= max \\q C ount{X 0 , k , r) - q CO unt{Xi,k , r)|| 

Xi:H(Xo,Xi)=l 

< max \CV(X 0l x 0l k,r)\CV(X 0l XQ,k - l,r)\ + 1 

xo€Xo,x' 0 eS 

< max max { \CV{Xq, xq, k, r)|, \CV{Xq, x' 0: k — 1, r)|| + 1 

x 0 £X 0 ,x' 0 £S 

= max < max{CV(Xo, x , k , r)}, vnax{CV (Xo, x, k — 1, r)} > + 1. 

I x£Xq x' £S I 


□ 


Naive evaluation of the local sensitivity is intractable. An algorithm to evaluate this upper 
bound is presented in Section l4~3l 

4.2.2 Smooth Sensitivity 

Given database X, let X t be a database s.t. H (Xo, Xj = t. By definition, the smooth 
sensitivity of q CO unt is given as 



max e 

£=0,1,...,n 



where 


LSjf* (X) 

Hcount ' ' 



Here, X t denotes a database s.t. H(X, X t ) = t. The function LS^ (X) returns the largest lo¬ 
cal sensitivity among the datasets of which t records differ from X. Similarly to LSq^ ount (X), 
exact evaluation of LSq} ount (X) is intractable because the variation of X t can increase ex¬ 
ponentially with respect to t. Instead, we derive the upper bound on LSq} ount (X ) using 


CV(X,x,k,r). 


Teorem 6. Given X, for t > 0, LSq} ount (X ) is bounded above as 



where 


C^(X,x,k,r) = CV(X,x,k + i,r) 


i=—t 
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For the proof of this theorem, we use the following helper lemma. 

Lemma 1. Let t > 0 be an integer, and let X and Xf be databases such that H(X,Xt) = t. 
Then, for any x £ D, threshold k, and radius r, 


\CV(X t ,x,k,r)\ < 


t 

U CV(X,x,k + i,r) 

i=—t 


+ t. 


Proof of Lemma® We first consider the case t = 1. Suppose x G X is moved from x to x \, 
and X\ is given as X\ = X \ {x} U {aq }. The degree of records in X \ {x} around x is 
decreased by one by removing x, and the degree of records in X \ {x} around xi is increased 
by one by adding x\. Since the degree of the records in V ( X , k + 1, r) and V ( X , k — 1, r) 
may become k in X\, V (X\ ,k, r ) is thus a subset of V (X, fc + l,r)UF {X, k, r) U V (X, k — 
1 ,r) U {xi}. When t > 1, for the same reason, V(X t ,k,r) is a subset of [f i= _ t V(X,k + 
i, r) U {* 1 , ® 2 , x t } where x\, ...,x t are the records moved from X to X t . Thus, the size of 
CV(Xt, x , r, k) is bounded above as 


\CV(X t ,x,r,k)\ < 


B(x,r) FI V(X,k + i,r)U{xi,x 2 ,...,x t } 


i=—t 


< 


< 


y B(x,r)nV(X,k + i,r) 

= -t 
t 

y CV(X,x,k + i,r) +t. 




i=—t 


□ 


By using Lemma[T] we now prove the Theorem [6] 


Proof of Theorem® As proved in Theorem [5] the local sensitivity of the query q CO unt is 
bounded above by 

LS qllunt( X ) ^ max{max \CV(X, x,r, fc)|, max \CV(X,x,k - l,r)|} + 1. 


xex 


x^.T) 


From exchangeability of max, letting 


CoutiX, k,r) = max max | CV(X t , x , r, k) |, and 
X t :H(X,X t )=t xGXt 

C t\ x ,k-l, r )= max max\CV(X t ,x,r,k - 1)|, 

X t :H(X,Xt)=t xeV 


yields 


LS tL^ X ) < ma x{CZ(X, k, r), Cg (X, k - 1, r)} + 1. 
Cg^tiX, k, r) can be bounded above using cg\x, k — 1, r) as 

cW(X,fc,r)= max max \CV(X t , x, r, k)\ 

X t :H(X,X t )=tx£Xt 

< max max \CV(Xt, x, r, fc)| 

Xf.H(X,X t )=t x£V 7 

=cg\x,k,r). 


(5) 


( 6 ) 


11 










Here, we use the fact that X' C V. By Lemma[T] we have 


C$(X, k,r) < max 
xeV 


j CV(X,x,k + i,r) 


i=—t 


+ t. 


By substituting eqs. [6] and [7] into eq. ©, we get the claim. 


(V) 

□ 


4.3 Efficient Computation of Smooth Sensitivity Bound 

For randomization by the mechanism of Theorem [T] it is necessary to evaluate the smooth 
upper bound. Naive evaluation of the smooth upper bound of eq. © is intractable because it 
requires an exhaustive search over continuous domain to evaluate LSq} ount (X) . To alleviate 
this, we first show an efficient algorithm that evaluates the upper bound of LSq} ount (X) shown 
derived by Theorem |6] Then using the algorithm, we derive the algorithm that calculates the 
smooth sensitivity upper bound. 


4.3.1 Algorithm for local sensitivity bound 

To evaluate the upper bound of LSq} ount (X), we need to calculate 

t 


max (X, x, k, r ) 

xev 


= max 
xev 


[J V(X,k + i,r)nB(x,r) 

i=—t 

t 


and 


max C' mT, x , k — 1, r) = max 
xev xev 


U V(X,k + i 


1, r) FI B(x, r) 


( 8 ) 

(9) 


Letting P = \j{ = _ t V ( X, k + i, r ) (resp. P = (J \ = _ t V (X, k + i — 1, r)), we can obtain 
the value of eq.® (resp. eq.©) by finding the largest subset C C P that is enclosed by a 
ball with radius r. To check whether or not a given subset C C P is enclosed by the ball, we 
use the algorithm that solves the smallest enclosing ball (seb) problem If20l . The goal of the 
problem is to find the smallest ball that encloses the given points. The given subset C C Pis 
enclosed by a ball with radius r if seb(C') < r where seb(C') denotes the radius of the resultant 
ball of the smallest enclosing ball problem of C. 

Algorithm [Tj shows the recursive algorithm that calculates eq. © or eq. © for given 
P = |J \=-t V (^> k + i,r) or P = V (X, k + i — 1, r ). P[i] denotes the i-th element 

of the set P. Algorithm Q] searches for the largest subsets C C P that is enclosed by a ball 
with radius r with the breadth-first search. In the algorithm, the calls of seb can be skipped for 
efficiency by using the fact that the radius of the enclosing ball of C 2 is larger than one of C\ 
if C\ C C 2 C P. The computational cost of Algorithm Q] is 0( 2l p l) of the calls of seb. 


4.3.2 Algorithm for smooth sensiticity bound. 

Algorithm Q] costs exponential time with respect to P\ and the size of P increases mono- 
tonically as t increases. However, because of exponential decrease of (P u \ maximization of 
e-^LSil L*(X) is attained by small t in most cases. Taking account of this property, we 
provide Algorithm [2] that calculates the smooth sensitivity bound with avoiding evaluation of 
LSqinrJX) of large t. 
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Algorithm 1: Calculation of ma x xeT > (A", x, k, r)(eq. ® and eq. ([9])) 

Input: Records P and radius r. 

Output: The value of eq.® or eq.©. 

Initialization: C = 0 and i — 1 

t Function E ( r,P,C,i) 

2 

br F- 0 

3 

if C 7 ^ 0 then 

4 


br <— seb(C') 

5 

end 


6 

if br < r then 

7 


m F- \C | 

8 


if i < 1 P\ then 

9 



h <-E (r, P,CU {P[i\},i + 1) 

10 



b 2 E (r, P,C,i + 1) 

11 



m F- max{m, b \, b 2 } 

12 


end 

13 


return m 

14 

end 


15 

else 


16 


return 0 

17 

end 


is end 




Proposition 1. For any t and t' < t, LSqf ount is bounded above as 

LS qlnnt ( X ) ^ ttnin{N , max{U^ (X, k, r), (X, k - l,r)} + t + l}, 

where 

U V(X,k + i,r ) . 

Sketch of proof. For any database X , because the number of outliers does not exceed the num¬ 
ber of the records in X, the local sensitivity is less than N. In addition, using the fact that 

CV(X,x,k,r) C V(X,k,r ) for any x S V, we can derive max^ evC®(X, x,k,r ) < 
uj:p ( X , k, r ) for any t and t' < t. □ 

Using the bound in Proposition [T] we have the upper bound of e~ t/3 LSq^ ount (X) as 

e- tp LSV ount (X) <e-tf min {N, ma x{U® (X, k, r), U® (X, k - 1, r)} + t + 1} 

Letting 5 {j B (X) = max,-]. <Sub~*(X)> we can obtain the following proposition. 

Proposition 2. If there exists Ut such that max f= o t e~ t/3 LSq} ount (X) < Ut and B (X) < 
U T , then S* count (X) < U T . 


(X, k, r ) = max C^'\X, x, k, r ) + 
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Algorithm 2: Calculation of the smooth sensitivity of q cmm t 
Input: Database X, threshold k, radius r and smooth parameter e. 

Output: The smooth sensitivity upper bound of query q count for database X. 

Initialization: S m ax = 0 and 

ma XjEgu C^~ l \X, x, k , r) = ma x xe x> C ( '~ 1 ' I (X, x, k — 1, r) = 0. 

1 for t = 0 to N do 

2 Calculate S'^ 1 by Proposition [2] 

3 if Sub 1 < S max then 

4 | return S max 

5 end 

6 S max «- rnax{S max , e~ tfi LS$ ount ( X )} 

7 Store maxjg® C® ( X, x , k, r ) and max^gp C® (A, a;, k — 1, r) for calculating Su B in 
next loop 

8 end 

9 return ,S max 


Pmo/ If S? B (X) = max i=1 ,...^_ T 5^ T+i (X) < U T , since ^LS^JX) < S^(X) 

for any t > T, we have e ~ t0 LSq] ount (X) <Ut ,Vt > T. Thus, we have max t=0 ,...,T e ~ tl3 LSq^ ount (A) < 

U T and rnax t>T LS$ ount (A) < U T . □ 

Proposition [2] shows that if the largest upper bound in Theorem [6] for t = 0, ...,T can be 
bounded above by S’ub(A), then the calculation of the upper bound in Theorem [6] for t > T 
can be skipped. Algorithm [2] shows the calculation of the smooth sensitivity of q COU nt with this 
skip by following Proposition [2] 


5 Differentially Private Discovery and Detection 

We are able to get the number of outliers in the database while ensuring (e, ^-differential 
mechanism by previous technique. Next, we try to achieve analyzing like Scenario 2 and 3. 
We descrive how to achieve Scenario 2 in Section [5711 and Scenario 3 in Section ??. 

5.1 Top -h Subspace Discovery with Exponential Mechanism 

This section investigates differential privacy for finding subspaces that contains outliers. As 
already discussed, identification of outliers and protecting privacy of instances are intrinsi¬ 
cally incompatible. However, if the interest of the analyst is simply to learn the situations that 
the outliers appear, we can alleviate releasing the outlier counts on each subspace; releasing 
subspaces containing many outliers would suffice Ifl2l . In this section, we present another 
mechanism for differential privacy that allows us to learn top -h subspaces that contains out¬ 
liers. 

The subspace containing many outliers can be simply found by issuing count queries for 
each subspace. However, responses obtained with such a procedure can be useless in typical 
settings. Let T c be the set of subspaces spanned by c dimensions. Then, the privacy parameters 
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need to be set as (e/|S'|, <5/|S1) f° r cac h count query so that the entire process achieves (e, 5)- 
differentially private because of sequential composition. The privacy budget can be saved by 
applying the more sophisticated composition theorem If2l1 . However, noting that the size of 
F c can grow exponentially in d, it is still difficult to manage high dimensionality. 

We consider the problem of the top-// subspace discovery by means of the exponential 
mechanism, which allows us to avoid releasing outlier counts for each subspace. Let d be the 
instance dimension and let E = {1,2, •• • . d\. Then, the set of c-dimensional subspaces is 
denoted as T c = {F C \F C C E, F c \ = c}. Given c and E, the top -h subspace discovery is the 
problem to find the h subspaces in F c containing the h largest number of outliers. The expo¬ 
nential mechanism can be used to release discrete values with achieving differential privacy. 
We employ the following function as the utility function for the top -h subspace discovery: 

u h (X r k S) - ^ 

^subspacey^ 5 1 1 u ) QgJJB /^\ 

Qcount, \S\ ' ' 

where GS BB t ^ (/.:) denotes the upper bound of the global sensitivity of the count query, 
as derived by eq. (0) in Section 14.11 The following theorem denotes the differential privacy 
achieved by the exponential mechanism with this utility function. 

Teorem 7. The exponential mechanism with utility function u su b sp ace achieves 2e-dijferential 
privacy. 


Proof. The global sensitivity of utility function u su b S pace is given as 


^ttsubspace ni/TX ^ ^ Ji" fl'p X') 1 ^ ^subspace (A, k * A S) It subspace j k, A F) \ 


= max max 

SX,X'eD n :H(X,X')=l 


11 Qcount (X , k , r, S') Qcount ( X , k , V, 5 ) 11 

GS™ Zlk) 

Qcount, \o | ' ' 


GS, 


Qcount 


\s\(k) 




< i. 


Hence, we have A u su b spa ce < 1. The exponential mechanism with utility function u su b spa ce 
thus achieves 2eAn su fe space -differential privacy, which concludes the proof. □ 

To obtain the top -h (suspected) subspaces, we need to iterate the exponential mechanism 
until h different subspaces are found. Algorithm [3] denotes the entire procedure for the top -h 
query with e-differential privacy. Therefore, top-// discovery of subspace query could also be 
adupted. At line 2-4, the utility for each subspaces in F c are evaluated. At line 5-10, h sub¬ 
spaces are chosen by iterative application of exponential mechanism Cu^ lbsparx • Finally, the se¬ 
lected subspaces arc released. Note that the privacy parameter for the exponential mechanism 
is set to e/h so that the entire procedure of the top -h subspace discovery achieves e-differential 
privacy. 


6 Experiments 

In this section, we show the empirical evaluation of the utility of the mechanism for counting 
outliers query, discovery of subspace query and detection of outlier query. 
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Algorithm 3: Mechanism of Top- h Query 
Input: Top-// query q h and smooth parameter e. 
Output: The top -h items .Re¬ 
initialization: R h <— 0 

1 Function q h (R,h,-) 

2 for r e R do 

3 | calculate the utility of a item r by Me (r, •) 

4 end 

5 while |Re| < h do 

6 repeat 


7 

8 
9 

10 

11 

12 


| r e u h Au h 

until r £ Rh 

Rh {r} U Rh 


end 

return R 


end 


h 


6.1 Settings 

We conducted the experiments on some synthetic and real datasets for Scenario 1 and Scenario 
2. As real datasets, we used Adult and Ionosphere datasets chosen from UCI Machine Learning 
Repository If22l which are originally prepared for classification tasks. For adapting outlier 
analysis, we carried out the preprocessing to the datasets in the same manner of | [23l [24 1. These 
datasets were scaled so that the average and variance of each attribute is 0 and 1, respectively. 
For Adult, we removed two categorical attributes, “category” and “fnlwgt”. 

The experiments for Scenario 1 were earned out on the two datasets, named Synthetic 1 
and Adult 1. Synthetic 1 consists of 50 samples of 2 dimensional real vectors, which contains 
45 inliers and 5 outliers. The inliers are sampled from J\f(0, 1) where I represents an identity 
matrix, and the outliers are sampled from AT (/it, £) where ji\ = /i 2 = 20, and £ is a diagonal 
matrix such that £n = £22 = 100. Adult 1 is a subset of the original Adult dataset which 
contains 45 positive labeled samples and 5 negative labeled samples. The positive labeled 
samples and the negative labeled samples arc treated as inliers and outliers, respectively (See 
Tabled] for the detail). 

The experiments for Scenario 2 were conducted on the three datasets, named Synthetic 2, 
Adult 2 and Ionosphere. Synthetic 2 consists of 500 samples of 10 dimensional real vectors. 
The dataset contains 490 inliers sampled from jV(0, 1) and 10 outliers sampled from A f(n, £) 
where fi\ = ^2 = 20, = 0 for i = 3..., 10, I represents an identity matrix, and £ is 

a diagonal matrix such that £n = £22 = 100 and £^ = 1 for i = 3..., 10. Adult 2 and 
Ionosphere are subsets of the original Adult or Ionosphere datasets which contains 490 positive 
labeled samples and 10 negative labeled samples in Adult 2, and 225 positive labeled samples 
and 10 negative labeled samples in Ionosphere. The treatment of inliers and outliers in real 
datasets is same as the experiments for Scenario 1 (See Table [2]for the detail). 
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Table 1: Sumarry of datasets and parametas for Scenario 1 



Synthetic 1 

Adult 1 

The number of outliers 

5 

5 

The number of inliers 

45 

45 

The number of samples N 

50 

50 

Dimension d 

2 

7 

Treshold k 

3 

3 

Radious r 

1.1 

0.35 


Table 2: Sumarry of datasets and parametas for Scenario 2 and Scenario 3 



Synthetic 2 

Adult 2 

Ionosphere 

The number of outliers 

10 

10 

10 

The number of inliers 

490 

369 

225 

The number of samples N 

500 

379 

235 

Dimension d 

10 

7 

34 

Treshold k 

3 

3 

3 

Radious r 

0.13 

0.02 

0.06 


6.2 Count Outliers 

Following the Scenario 1 described in Section [3731 we evaluated the utility of the mechanisms 
of Qcount on the synthetic dataset. We changed the privacy parameter from e = 0.1 to 0.9; 
5 was fixed as <5 = 0.01. See Table [I] for the parameters of the outliers. We partitioned the 
instances into two classes: one is “true”, indicating the instance detected as an outlier; the 
other is “false”. For each dataset, we tuned the radius r so that the Accuracy given by cq. dTol) 
is maximized: 

TP + TN 

A ccuracy = Tp+Fp + FN + TN , <‘°> 

where TP, TN, FP and FN respectively denote true positive, true negative, false positive, 
and false negative. For implementation, we used ll25l to solve the smallest enclosing ball 
problem. As the criterion of the utility of the mechanisms, we show the standard deviation 
of the noise added to the query. We compared the standard deviation of the noise of the 
mechanism based on the smooth sensitivity upper bound in eq© with the mechanism based 
on the global sensitivity lower bound in eq.©. Fig. [3] and Fig. [4] show true the number of 
outliers in the database and the standard deviations ( crciobai and <J smooth) °f the gaussian for 
each e. In Fig©and Fig© “Global” and “Smooth” respectively present the global sensitivity- 
based mechanism and the smooth sensitivity-based mechanism. 

It is apparent that the standard deviation of the noise of the smooth sensitivity-based mech¬ 
anism is significantly lower than that of the global sensitivity-based mechanism. Indeed, the 
standard deviation of the noise of global sensitivity-based mechanism is approximately 10- 
30 times larger than that of the smooth sensitivity-based mechanism even though the global 
sensitivity-based mechanism uses the lower bound. In addition, the smooth sensitivity-based 
mechanism achieves the noise of which standard deviation is lower than 7 for e > 0.7 for each 
datasets. The reason why we got these results is our approach depends only on the number 
of outliers, not on the number of dimensions. From these results, we can conclude that our 
framework is sufficiently practical in this setting. 
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Figure 3: The result of Synthetic 1 on Scenario 1 
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Figure 4: The result of Adult 1 on Scenario 1 


6.3 Top-/t Subspace Discovery 

The experiments of top-// subspace discovery shown in this subsection follow Scenario 2 of 
Section [331 The analyst investigates the subspace contains more outliers using query q su bspace- 
In these experiments, the dimensionality of the subspace is set as 1; the analyst tries to detect 
2 out of 10 subspaces by top -h Subspace discovery. 

For evaluation puiposes, we partitioned the subspace into two classes: one is “true”, in¬ 
dicating the subspace containing outliers; the other is “false”. The utility of the results is 
measured from the precision and recall. The precision is evaluated by precision = T p+ Fp , 
where TP and FP respectively denote true positive and false positive. The recall is evaluated 
by recall = T p pFN , where FN denotes false negative. The prediction and recall are one 
thousand times average. Privacy parameter was varied from e = 0.2 to 3.2. 

Fig. Ml (left) and Fig. BTITl (right) respectively represent the precision and recall, with 
changing h, the number of subspaces detecting. The precision decreases as h grows, as shown 
in Fig. [5] The recall can be improved with larger h because the probability with which true 
subspaces are chosen increases. Because of sequential decomposition, the outputs of the ex¬ 
ponential mechanism become noisy as h increases. Therefore, the recall can be decreased if 
the effect of noise is dominant. As Fig. [5] shows, the effect of sequential composition was 
more dominant and smaller h achieved larger recall in this experiments. However there isn’t 
distinctive subspace that has many outliers. It is difficult to apply top-/i subspace discovery 
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Figure 5: The result of Synthetic 2 on Scenario 2 
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Figure 6: The result of Adult 1 on Scenario 2 


when the difference of the number of ouliers are not. 

For practical use, the precision and recall are preferred to be much higher than 1/2. If the 
number of true subspaces can be known by analysts in advance, then h should be set as small as 
possible. Privacy parameter e and utility (precision and recall) share a tradeoff relation. Noting 
that the objective of outlier analysis is fundamentally conflicting with privacy protection, the 
choice of larger e, such as 0.8 < e < 1.6, might be allowed. 


7 Conclusion 

In this paper, we present the differentially private distance-based outlier analysis that consists 
of two different types of queries, the differentially private counting of outliers in given sub¬ 
space and the differentially private discovery of subspaces. 

For the query of counting of outliers, taking advantage of the smooth sensitivity O, the 
resulting output of the mechanism can be less noisy than that of the global sensitivity based 
mechanism. Although the evaluation of the smooth upper bound is often costly, we provide 
an efficient algorithm for evaluation of the smooth upper bound for the problem for outlier 
counting. This paper describes an initial step towards differentially private outlier analysis, and 
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Figure 7: The result of Ionosphere on Scenario 2 

the experimental evaluation is performed with relatively small-size datasets. In our algorithm, 
we invoke the smallest enclosing ball algorithm that takes as input the power set of instances. 
Because of this construction, we need a more efficient algorithm for application to larger size 
datasets. 

For the query of discovery of subspaces, we employ the exponential mechanism and specif¬ 
ically design a utility function. Even though the variation of subspaces can grow exponentially 
in the data dimensionality, the proposed mechanism achieves better detection accuracy for high 
dimensionality. 
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